Optimisation of corpus-derived probabilistic grammars
نویسنده
چکیده
This paper examines the usefulness of corpus-derived probabilistic grammars as a basis for the automatic construction of grammars optimised for a given parsing task. Initially, a probabilistic context-free grammar (PCFG) is derived by a straightforward derivation technique from the Wall Street Journal (WSJ) Corpus, and a baseline is established by testing the resulting grammar on four different parsing tasks. In the first optimisation step, different kinds of local structural context (LSC) are incorporated into the basic PCFG. Improved parsing results demonstrate the usefulness of the added structural context information. In the second optimisation step, LSC-PCFGs are optimised in terms of grammar size and performance for a given parsing task. Tests show that significant improvements can be achieved by the method proposed. The structure of this paper is as follows. Section 2 discusses the practical and theoretical questions and issues addressed by the research presented in this paper, and cites existing research and results in the same and related areas. Section 3 describes how LSC-grammars are derived from corpora, defines the four parsing tasks on which grammars are tested, describes data and evaluation methods used, and presents a baseline technique and baseline results. Section 4 discusses and describes different types of LSC and demonstrates their effect on rule probabilities. Methods for deriving four different LSC-grammars from the corpus are described, and results for the four parsing tasks are presented. It is shown that all four types of LSC investigated improve results, but that some lead to overspecialisation of grammars. Section 5 shows that LSC-grammars can be optimised for grammar size by a generalisation technique that at the same time seeks to optimise parsing performance for a given parsing task. An automatic search method is described that carries out a search for optimal generalisations of the given grammar in the space of partitions of nonterminal sets. First results are presented for the automatic search method that show that it can be used to reduce grammar size and improve parsing performance. Parent node information is shown to be a particularly useful type of LSC, and the results for the complete parsing task achieved with the corresponding grammar are better than any previously published results for comparable unlexicalised grammars. Preliminary tests for LSC grammar optimisation show that it can drastically reduce grammar size and significantly improve parsing performance. In one set of experiments, a partition was found that increased the labelled F-Score for the complete parsing task from 72.31 to 74.61, while decreasing grammar size from 21,995 rules and 1,104 nonterminals to 11,254 rules and 224 nonterminals. Results for grammar optimisation by automatic search of the partition space show that improvements in grammar size and parsing performance can be achieved in this way, but do not come close to the big improvements achieved in preliminary tests. It is concluded that more sophisticated search techniques are required to achieve this.
منابع مشابه
Studying impressive parameters on the performance of Persian probabilistic context free grammar parser
In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...
متن کاملOn the use of probabilistic grammars in speech annotation and segmentation tasks
The present paper explores the issue of corpus prosodic parsing in terms of prosodic words. This question is of importance in both speech processing and corpus annotation studies. We propose a method grounded on both statistical and symbolic (phonological) representations of tonal phenomena and we have recourse to probabilistic grammars, whithin which we implement a minimal prosodic hierarchica...
متن کاملProbabilistic Context-Free Grammars for Phonology
We present a phonological probabilistic contextfree grammar, which describes the word and syllable structure of German words. The grammar is trained on a large corpus by a simple supervised method, and evaluated on a syllabification task achieving 96.88% word accuracy on word tokens, and 90.33% on word types. We added rules for English phonemes to the grammar, and trained the enriched grammar o...
متن کاملOn the Utility of Curricula in Unsupervised Learning of Probabilistic Grammars
We examine the utility of a curriculum (a means of presenting training samples in a meaningful order) in unsupervised learning of probabilistic grammars. We introduce the incremental construction hypothesis that explains the benefits of a curriculum in learning grammars and offers some useful insights into the design of curricula as well as learning algorithms. We present results of experiments...
متن کاملCount-based State Merging for Probabilistic Regular Tree Grammars
We present an approach to obtain language models from a tree corpus using probabilistic regular tree grammars (prtg). Starting with a prtg only generating trees from the corpus, the prtg is generalized step by step by merging nonterminals. We focus on bottom-up deterministic prtg to simplify the calculations.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001